SNP Haplotype discovery and analysis workflow
For family based genetic studies it is important to use highly polymorphic markers since 
parents are only informative if they are heterozygous. The frequency of heterozygotes
with bi-allelic SNP cannot exceed 50% wheras the frequency of heterozygotes with more 
polymoprhic markers can approach 100%. SNP haploptypes have 2^n possible alleles where n 
is the number of SNP and can therefore be used where highly polymoprhic markers are 
required.

We have developed a pipeline for discovery of SNP haplotypes in candidate genes for use 
in candidate gene studies. We have developed the pipeline using the following candidate 
genes:  IFNG, IL10, IL13, IL4, IL5, STAT6, CTLA4, FCN2, COLEC11, ABO, IL17A, IL17B, CRP,
IL6R, IL17F, IL9, CD14, CXCL14, IL3, IL12B, VEGFA, CTGF, IL22RA2, NOS3 and SHH. 
Each candidate gene was evaluated in the five African populations from the 1,000 genomes 
dataset (GWD, MSL, YRI, ESN, LWK) both individually and combined.

The pipeline consists of one wrapper script (runBigLD.sh), which creates input files for
each population and gene and then:
1) calls runBigLD.R which identifies haplotype block boundaries and makes a heatmap of 
the LD between SNP in the gene. 
2) calls makeHaplotypes.pl which obtains the SNP in each haplotype block, finds the 
sequence of each haplotype in the block and calculates the frequency of each haplotype. 
It outputs two files (where POP is a population name): 
	1) Summary.HapStats.1kg.h3a.POP.txt containing summary statistics on each haplotype 
	block in each gene in a population
	2) All.HapsStats.1kg.h3a.POP.txt containing the sequence of each haplotype in each
	 haplotype block in each candidate gene and counts of observations of each haplotype.
The output from makeHaplotypes.pl is included as Summary.HapStats.1kg.h3a.GWD.txt, 
Summary.HapStats.1kg.h3a.MSL.txt, Summary.HapStats.1kg.h3a.YRI.txt,
Summary.HapStats.1kg.h3a.ESN.txt, Summary.HapStats.1kg.h3a.LWK.txt,
Summary.HapStats.1kg.h3a.ALL.txt and All.HapsStats.1kg.h3a.GWD.txt, 
All.HapsStats.1kg.h3a.MSL.txt, All.HapsStats.1kg.h3a.YRI.txt,
All.HapsStats.1kg.h3a.ESN.txt, All.HapsStats.1kg.h3a.LWK.txt,
All.HapsStats.1kg.h3a.ALL.txt

Input files for runBigLD.sh:
1) A single set of plink bed, bim and fam files with data for all the genes and 
populations of interest. In this case we used SchistoCandGenes.1kg.h3a* 
2) genes.ensGRCh37.txt a file with co-ordinates of all genes of interest. This can be a 
file of co-ordinates of all genes in genome downloaded from Biomart or UCSC, the script 
will extract the relevant rows. It is important that the chromosome number, gene start 
and gene end are in columns 1,2 and3 and the gene name is in column 5. Additional columns 
will be ignored.

Ancillary Scripts:
1) replaceFamIds.pl a Perl script to modify 1000 genome sample ids so that they ids 
include a three letter population identifier. This makes it possible to extract populations 
from the fam file with grep. Uses ids and population names in OneKg.pops. Takes a fam 
file with 1000 genome ids as input parameter. 
eg to run the command: "perl replaceFamIds.pl filename.fam"
2) plotBlocks.R an Rscript to make plots of the haplotype blocks found by BigLD. Uses the 
output files from  makeHaplotypes.pl as input. In addition it needs the file of exon 
co-ordinates exons.ensGRCh37.txt